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ABSTRACT 

Can one estimate the number of remaining faults in a soft- 
ware system? A credible estimation technique would be im- 
mensely useful to project managers as well as customers. 
It would also be of theoretical interest, as a general law of 
software engineering. We investigate possible answers in the 
context of automated random testing, a method that is in- 
creasingly accepted as an effective way to discover faults. 
Our experimental results, derived from best-fit analysis of a 
variety of mathematical functions, based on a large number 
of automated tests of library code equipped with automated 
oracles in the form of contracts, suggest a poly-logarithmic 
law. Although further confirmation remains necessary on 
different code bases and testing techniques, we argue that 
understanding the laws of testing may bring significant ben- 
efits for estimating the number of detectable faults and com- 
paring different projects and practices. 

1. INTRODUCTION 

A scientific discipline is characterized by general laws, 
such as Maxwell's equations. Where the topics of discourse 
involve human phenomena, and belong to engineering rather 
than science, the laws cannot be absolute truths compara- 
ble to the laws of nature, and often involve a probabilistic 
element; but they should still describe properties and con- 
straints that govern all or almost all instances of the phe- 
nomena under consideration. 

Software engineering is particularly in need of such laws. 
Some useful ones have already been identified, in areas such 
as project management; an example [5j [l2] is the observa- 
tion, first expressed by Barry Boehm on the basis of his 
study of a large database of projects, that every software 
project has a nominal cost, deducible from the project's 
overall characteristics, from which it is not possible to devi- 
ate - whether by giving the project more time or by includ- 
ing more developers - by more than about 25%. 

One area in which such a general law would be of partic- 
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ular interest is testing. It has been well known since at least 
Lehmann and Belady's seminal work on faults ("bugs") in 
successive IBM OS 360 releases [4] that if a project tracks 
faults in a sufficiently diligent way the evolution of the num- 
ber of faults in successive releases follows regular patterns. 
Is it possible to turn this general, informal observation into 
a precise law of software testing, on which project managers 
and developers could rely to estimate, from the number of 
faults uncovered so far through testing, the number of de- 
tectable faults in an individual module or an entire system? 
This is the question that we set out to explore, and on which 
we are reporting our first results. 

1.1 Expected benefits 

Were it available, and backed by experimental evidence 
covering a broad enough base of projects and environments 
to make it credible, a law describing the rate of fault detec- 
tion would be of considerable interest to the industry. 

An immediate benefit would be to help project managers 
answer one of the most common and also toughest questions 
they face: can we ship yet! A release should be shipped 
early enough to avoid missing a market opportunity or be- 
ing scooped by the competition; but not too early if this 
means that so many faults remain as to provoke an adverse 
reaction from the market. Another application, for indi- 
vidual developers, is to estimate the amount of testing that 
remains necessary for a module or a subsystem. Yet another 
benefit would be to help assess how a project's or organiza- 
tion's fault patterns differ, for better or worse, from average 
industry practices. Such an assessment would be particu- 
larly appropriate in organizations applying strict guidelines, 
for example as part of CMMI 8 . Also of interest is the 
purely intellectual benefit of gaining insights into the na- 
ture of software development, in the form of a general law of 
software engineering that describes fault patterns and might 
help in the effort to avoid faults in the first place. 

1.2 Automatic testing 

In the present work, we set out to determine through em- 
pirical study if a general law of testing exists, predicting 
the number of detectable faults. The context of the study 
is automatic testing, in which test cases are not manually 
prepared by humans but generated automatically through 
appropriate algorithms. More specifically, we rely on auto- 
mated random testing, where these algorithms use partly 
random techniques. Once dismissed as ineffective 14 , ran- 
dom testing has shown itself, in recent years, to be a vi- 



able technique; such tools as AutoTest [l3] for Eiffel and 
Randoop [l9] and YETI [l7] for Java and .NET, which ap- 
ply automatic random testing, succeed by the criterion that 
matters most in practice: finding faults in programs, includ- 
ing released libraries and systems. 

With automated testing, the user selects a number of mod- 
ules - in object-oriented programming, classes - for testing; 
the testing tool then creates objects (class instances) and 
performs method calls on them with arguments selected ac- 
cording to a strategy that involves a random component. It 
then records any abnormal behavior (failure) that may be 
signaling a fault in the software. 

The work described here relies on automated random test- 
ing as just described. This approach has the advantage of 
simplicity, and of not relying on hard-to-control human ac- 
tions. Another advantage is reproducibility (up to the vari- 
ation caused by the choice of seeds in random number gen- 
eration). It also has the disadvantage of limiting the gen- 
eralizability of our findings (Section 3.2); but it is hardly 
surprising that a general empirical understanding of test- 
ing is likely to require much more investigation and varied 
experiments. 

The paper's main experiments involve software equipped 
with contracts (specification elements embedded in the soft- 
ware and subject to run-time evaluation, such as pre- and 
postconditions and class invariants); in such a context, the 
testing process focuses on generating failures due to contract 
violations. Since contracts are meant to express the specifi- 
cation of the methods and classes being tested, they provide 
testing oracles [23] to reliably identify faults (as opposed 
to mere failures). The bulk of our experiments (Section 2.2 ) 
target code with contracts and studies the evolution of found 
faults over time. Therefore, the goal is to obtain experimen- 
tally a general law T(t), where t is the testing time, mea- 
sured in number of drawn test cases and T(t) is the number 
of unique faults uncovered by the tests up to time t. 

If contracts are not available, a fully automated approach 
must rely on failures such as arithmetic overflow or running 
out of memory. In these cases, a methodological issue arises 
in practice since automated testing finds failures rather than 
faults; obtaining T requires a sound policy to determine 
whether two given failures relate to the same fault. We hint 
at some ways to address this problem in Section [23] 

Extended version. More detailed data and graphs are 
available in an extended version of the paper at: 

Appendix: Section [6] of this paper. 

2. EXPERIMENTS 

The experiments run repeated random testing sessions 
on Eiffel classe s (S ection |2.2[ ) with AutoTest, and on Java 
classes (Section 2.3 ) with YETI. Each testing session is char- 
acterized by: 

• c: the tested class; 

• T: the number of rounds in the simulation, that is the 
number of test cases (valid and invalid) drawn; 

• ip : [0..T] — »• IN: the function counting the cumulative 
number of unique failures or faults after each round. 

The dataset D c for a class c consists of a collection of count- 
ing functions (pi, (f2, . . . , ^s c , where S c is the total number 
of testing sessions on c. In our experiments, all source code 



was tested "as is", without injecting any artificial error or 
modifying it in any way. 

Given a dataset D c , consider the mean function of the 
fault count: 

^meanW = j4 ^ 

l<i<S c 

as well as the median (^median We considered nine model 
functions, described in Section |2.1| suggested by various 
sources, and fit each model function to each of </>mean(&) 
and 0median(^) from the experimental data. We used Mat- 
lab 2011 for all data analysis and fitting computations. The 
following sections illustrate the main results. 

2.1 Model functions 

We consider a number of "reasonable" fitting functions, 
suggested either by intuition or by the theoretical model 
discussed in Section [3.1| Some of the functions are special 
cases of other functions; nonetheless, it is advisable to try to 
fit both - special and general case - because the performance 
of the fitting algorithms may be sensitive to the form in 
which a function is presented. In all the examples, lower- 
and upper-case names other than the argument x denote 
parameters to be instantiated by fitting. 

Following an original intuition of analogy with biological 
phenomena - suggested in related work 16 - we consider 
the Michaelis-Menten equation: 



(1) 



x + B 

as well as its generalizations into a rational function of third 
degree: 

ax 3 + bx 2 + cx + d 



2y J Ax 3 + Bx 2 + Cx + D 

and a rational function of arbitrary degree: 



$ 3 (x) = 



ax + c 



(2) 



(3) 



Ax B + C 

The model analyzed in Section |3J"] suggests including func- 
tional forms similar to polynomials, as well as logarithms 
and exponentials. Hence, we consider logarithms to any 
power: 

$4 (a) = alog b (x + l) + c (4) 

and poly-logarithms of third degree: 

$ 5 (x) = alog 3 (x+l) + Mog 2 (x+l)+clog(x+l)+(i (5) 

where the translation coefficient +1 is needed because the 
fault function <j> is such that 0(0) = 0, but log(0) = — oo. 
The base of the logarithm is immaterial because the mul- 
tiplicative parameters can accommodate any. We also con- 
sider exponentials of powers: 



*e(x) = 



ab x + d 



(6) 



Polynomials up to degree three are covered by $2, but we 
still consider some special cases explicitly; third degree: 



arbitrary degree: 



ax 3 + bx 2 + cx + ol 



(7) 



(8) 



&%{x) — ax + c 
and third degree with negative powers: 

$9 (a) = ax~ 3 + bx~ 2 + cx -1 + al (9) 
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2.2 Experiments with Eiffel code 

Experiments with Eiffel used AutoTest [13] ; they targeted 
42 Eiffel classes, from the widely used open-source data 
structure libraries EiffelBase [To] and Gobo 11 . Table [I] 
reports the statistics about the testing sessions with Au- 
toTest on these classes; for each class, the table lists the 
number of testing sessions (S), the test cases sampled per 
session (T), the maximum number of unique faults found 
(F, corresponding to unique contract violations), the mean 
sampled standard deviation and skewness (E [a] , E [7] ) of 
the faults found across the class's multiple testing sessions, 
and the mean and standard deviation (E [A] , <j[A]) of the 
new faults found with every test case sampled (thus, for 
example, E [A] = 10 -3 means that a new fault is found 
every thousand test cases on average). The bottom rows 
display mean, median, and standard deviation of the values 
in each column. The data was collected according to the 
suggested guidelines for empirical evaluation of randomized 
algorithms 2 ; in particular, the very large number of drawn 
test cases makes the averaged data robust with respect to 
statistical fluctuations. 

2.2.7 Fitting results 

Table [2] reports the result of fitting the models $i-$g 
on the mean fault count 0mean(^) curves; for each class, col- 
umn Best Fit Ranking ranks the <I>;'s from the best fitting 
to the worst, according to the coefficient of determination 
R? (the higher, the better), and the root mean squared er- 
ror RMSE (the smaller, the better); the rankings according 
to the two measures agreed in all experiments. The other 
columns report the R 2 and RMSE scores of the best fit, and 
the absolute value of the difference between such scores for 
the best fit and the same scores for $5. The data shows that 
<3>5 emerges as consistently better than the other functions; 
as the bottom of the table summarizes, $5 is the best fit in 
62% and one of the top two fits in 90% of the cases; when 
it is not the best, it fares within 1% of the best in terms 
of scores. A Wilcoxon signed-rank test confirms that the 
observed differences between $5 and the other functions are 
highly statistically significant: for example, a comparison of 
the R 2 values indicates that the values for <E> 5 are different 
(smaller) with high probability (7.14- 10~ 7 < p < 1.65 -10~ 8 ) 
and large effect size (between 0.54 and 0.62, computed using 
the Z-statistics) . The same data with respect to the median 
curves is qualitatively the same as with the mean; hence we 
do not report it in detail. 

In a few cases, models other than $5 achieve a higher 
fitting score even if visual inspection of the curves shows that 
it is obviously unsatisfactory. The reasons for this behavior 
are arguably due to numerical errors; there are, however, a 
few cases of fits with high scores but visibly unsatisfactory, 
whose fitting process converged without Matlab signaling 
any numerical approximation problem. In any case, visual 
inspection confirms that $5 is a visibly proper fit to cp in all 
cases. 

2.2.2 Poly-logarithmic fits 

Once acknowledged that the evidence points towards a 
poly-logarithmic law, it remains the question of which poly- 
nomial degree achieves the best fit. Consider seven poly- 
logarithmic model functions £&, 1 < k < 7, of various 
degrees. C1—C5 have degree equal to the number of their 



nonzero coefficients plus one: 

Ck(x) = c + Yl c 3 \og j (x + l),l<k<5 (10) 

l<j<k 

Cq and £7 are, instead, binomials whose degree is a param- 
eter; namely, Cq equals $4 and: 

Cr(x) = alog 1/b (x + l) + c (11) 

The results show that it is always the case that £5 > 
£4 > £3 > £2 > £1, that is the higher the degree the bet- 
ter the fit; in hindsight, this is an unsurprising consequence 
of the fact that models of higher degree offer more degrees 
of freedom, hence they are more flexible. £3 = $5, how- 
ever, still fares quite well on average; the few cases where its 
performance is as much as 10% worse than £5 correspond 
to more irregular curves with fewer faults and hence fewer 
new datapoints, such as for class ARRAYED-QUEUE. Fi- 
nally, Cq and £7 occasionally rank third; whenever this hap- 
pens, however, the difference with the best (and with $5) is 
negligible. In all, $5 seems to be a reasonable compromise 
between flexibility and economy, but poly- logarithmic func- 
tions of higher order could be considered when useful. Notice 
that, even if in principle we could obtain enough degrees of 
freedom with any function of arbitrarily high degree, only 
poly-logarithmic works for reasonable degrees; for example, 
polynomials of fifth degree (which generalize $7) never pro- 
vide good fits. 

2.3 Experiments with Java code 

With the goal of confirming (or invalidating) the results of 
the Eiffel experiments, we used YETI to test 9 Java classes 
from java.util and 29 classes from java.lang. For lack of 
space, we only present a summary of the results and high- 
light the differences in comparison with Eiffel and the ques- 
tions about the comparison that remain open. 

Unlike in Eiffel, where the code has contracts, interpreting 
a Java failure to know whether it is a "real" fault cannot be 
done in a fully automated way in normal Java code without 
contracts. To address this point at least in part, YETI col- 
lects all the exceptions triggered during random testing and 
considers the following exceptions as not provoked by a fault, 
but merely accidents of the testing process: (1) declared ex- 
ceptions, including RuntimeException, as client code knows 
that they may be thrown and may even be part of a code 
pattern; (2) Invalid ArgumentException, comparable to pre- 
condition violations. This still falls short of a complete iden- 
tification of real faults, but it helps to reduce the number of 
spurious failures. 

Given this filtering performed by YETI, the experiments 
with the 38 Java classes are in two parts. The first part tar- 
gets the 9 java. util classes and 2 classes from java.lang, 
and analyzes all unique failures filtered by YETI as de- 
scribed above (two failures are the same if they generate 
the same stack exception trace). The second part targets 
the 29 classes from java. lang and only reports failures man- 
ually pruned by discarding all failures that reflect behavior 
compatible with the informal documentation (for example, 
division by zero when the informal API documentation re- 
quires an argument to be non-zero). In this case, we get to a 
fairly reasonable approximation of real faults; hence we will 
refer to the first part of Java experiments as "failures" and 
to the second as "faults". 
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Table 1: Eiffel classes statistics. 
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ARRAY 
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69 
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9.02E-05 


ARRAYED_CIRCULAR 


15 


7.77E+05 


52 
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2.75E-04 


ARRAYED_LIST 


14 


1.25E+06 


24 
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ARRAYED_QUEUE 
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6 


0.3742 
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ARRAYED_SET 


15 
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15 


1.0823 
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BINARY_TREE 
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2.25E+06 


40 
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7.56E-05 
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14 
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COMPACT_CURSOR_TREE 


14 


5.56E+05 


67 
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3.35E-04 


DS_ARRAYED_LIST 


12 


2.07E+06 


91 


7.0082 
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6.65E-05 


DS_AVL_TREE 


15 


5.10E+05 


16 
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-1.82E+00 
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DS_BILINKED_LIST 


14 


2.70E+06 
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2.2035 
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DS_BINARY_SEARCH_TREE 
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15 
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9.16E-05 
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12 
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5.18E-04 


1.67E-04 


DS_HASH_SET 


14 


1.86E+06 


19 
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6.31E-04 


1.83E-04 


DS_HASH_TABLE 


14 


1.66E+06 


25 


2.0158 
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1.07E-03 


1.88E-04 


DS_LEFT_LEANING_RED_BLACK_TREE 


14 


7.18E+05 


33 
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-3.45E+00 


4.32E-03 


1.73E-04 


DS_LINKED_LIST 


14 


2.17E+06 


94 


7.7943 


-2.67E+00 


3.88E-03 


1.88E-04 


DS_LINKED_QUEUE 


14 


2.23E+06 


9 


1.3880 


-4.98E-01 


3.05E-04 


5.89E-05 


DS_LINKED_STACK 


14 


2.22E+06 


10 


1.4648 


-5.39E-01 


3.57E-04 


6.23E-05 


DS_MULTIARRAYED_HASH_SET 


14 


2.61E+06 


22 


3.7948 
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7.61E-04 


5.60E-05 


DS_RED_BLACK_TREE 


15 
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32 
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FIXED_TREE 


13 


7.54E+05 


49 
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1.67E-04 


HASH_TABLE 


14 
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14 
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15 
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14 
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LINKED_DFA 


13 


2.72E+06 


135 


12.9783 


-1.97E+00 


4.67E-03 


1.69E-04 


LINKED_LIST 


14 


3.41E+06 


21 


1.4436 


-2.84E+00 


5.30E-04 


3.89E-05 


LINKED_PRIORITY_QUEUE 


14 


1.20E+06 


23 


2.1477 


-2.34E+00 


1.80E-03 


6.35E-05 


LINKED_SET 


14 


2.59E+06 


29 


2.1800 


-3.45E+00 


1.03E-03 


4.48E-05 
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15 
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SUBSET_STRATEGY_TREE 


15 


3.08E+06 


20 


1.8972 


-2.87E+00 


5.84E-04 


3.47E-05 


TWO_WAY_CIRCULAR 


13 


1.53E+06 


41 


4.3326 


-1.88E+00 


2.46E-03 


1.37E-04 


TWO_WAY_CURSOR_TREE 


13 


1.35E+06 


50 


4.5548 


-1.86E+00 


3.56E-03 


1.11E-04 


TWO_WAY_LIST 


15 


2.01E+06 


22 


1.7119 


-3.45E+00 


9.61E-04 


6.42E-05 


TWO_WAY_SORTED_SET 


14 


3.09E+06 


47 


4.2784 


-3.43E+00 


1.47E-03 


3.29E-05 


TWO_WAY_TREE 


14 


2.46E+06 


163 


28.6749 


-1.02E+00 


6.27E-03 


1.77E-04 


Mean 


14 


1.93E+06 


40 


4.3299 


-3.55E+00 


2.54E-03 


1.15E-04 


Median 


14 


2.04E+06 


31 


3.1640 


-2.42E+00 


1.49E-03 


9.09E-05 


Stdev 


1 


9.04E+05 


36 


5.3955 


4.08E+00 


2.49E-03 


9.59E-05 



Tables[3}j4]report the summary statistics about the testing 
sessions with YETI on all classes, with the same data as for 
Eiffel (see Section [2.2| ). The data for faults, where manual 
pruning eliminated the vast majority of failures as spurious, 
has far fewer significant datapoints. Skewness is not com- 
putable when there are no faults found, which happens for 
15 classes; hence the summary statistics about skewness are 
immaterial (NaN stands for "Not a Number"). 

Table 3: Java classes tested for failures: statistics. 

Summary S T F E [a] E [7] E [A] <r[A] 
Mean 30 1.23E+05 32 3.9044 -8.02E+00 1.03E-02 1.86E-04 
Median 30 1.18E+05 15 1.7799 -6.10E+00 7.16E-03 6.68E-05 
Stdev 1.65E+04 29 3.8803 4.70E+00 8.86E-03 2.63E-04 



Table 4: Java classes tested for faults: statistics. 

Summary S T F E [a] E [7] E [A] a [A] 

Mean 4 4.68E+05 2 0.2356 NaN 1.73E-03 2.51E-05 
Median 4 2.61E+05 0.0000 NaN 0.00E+00 0.00E+00 
Stdev 1 9.45E+05 3 0.4094 NaN 2.72E-03 1.26E-04 



Fitting results 

We tried to fit the models $1, $2, ^4, ^5 on the mean fault 
count 0mean(^) curves, for both Java failures and faults. The 
data is somewhat harder to fit than in the Eiffel experiments, 
and in fact we had to exclude the other models because they 
could produce converging fixes in only a fraction of the ex- 
periments. In comparison with the Eiffel data, there is a 



detectable overall difference: $2 and $1 seem to emerge as 
the best models, whereas $5 is best only in a limited num- 
ber of cases. Looking at the R 2 scores, <E>i and $2 alternate 
as best model for the failure curves, and their difference is 
often small (p = 0.067 and effect size 0.398); for the fault 
curves, $1 ranks first in the majority of classes, but its differ- 
ence w.r.t. $2 is statistically insignificant (p = 0.5 and effect 
size 0.095). A common phenomenon is that the majority of 
curves </>mean(^) with YETI show an horizontal asymptote, 
which $1 or $2 can accommodate much better than $5 can. 

A closer look suggests that the differences w.r.t. the Eif- 
fel experiments may still be reconcilable. First, in the ma- 
jority of cases of failures, $5 fits to within the 5% of the 
best model; the only exceptions are the java. util classes 
ArrayList, Hashtable, and LinkedList where visual inspection 
shows curves with a steeper initial phase that $2 fits best, 
and where <E>i and $5 behave similarly. For the fault exper- 
iments, the picture is more varied; but $5 fits to within the 
20% of the best models in all classes but three; and the qual- 
ity of fitting a certain model is much more varied because of 
the small number of faults found in these experiments (but 
we excluded from these statistics the classes with no faults 
found). Finally, the difference between $2 and $5 is often 
statistically insignificant; the difference between $1 and $5 
is significant (p c± 0.2) but not large for faults (effect size 
0.3); in contrast, the difference between $5 and $4 is highly 
statistically significant and large (p < 10 -3 and effect size 
from 0.43 to 0.62). In all, further experiments are necessary 
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Table 2: Testing of Eiffel classes: best fits with mean. 

Class Fit Ranking R 2 (best) RMSE (best) R 2 (A best) RMSE (A best) 



ACTIVE_LlST 


2 


5 


4 


1 


9 


6 8 


3 


7 


0. 


,9825 


0.1304 


0, 


,0108 


-0.0353 


ARRAY 


5 


4 


1 


9 


8 


6 3 


7 


2 


0. 


,9894 


0.5151 


0. 


,0000 


0.0000 


ARRAYED_CIRCULAR 


5 


3 


4 


8 


1 


2 9 


6 


7 


0. 


,9978 


0.2461 


0, 


,0000 


0.0000 


ARRAYEDJLIST 


3 


5 


4 


1 


8 


9 6 


7 


2 


0. 


,9877 


0.1485 


0, 


,0202 


-0.0932 


ARRAYED_QUEUE 


8 


5 


4 


3 


1 


9 2 


6 


7 


0. 


,7299 


0.1547 


0, 


,0027 


-0.0008 


ARRAYED_SET 


5 


4 


1 


2 


8 


9 6 


3 


7 


0. 


,9550 


0.2078 


0, 


,0000 


0.0000 


BINARY_TREE 


3 


5 


4 


8 


1 


7 9 


6 


2 


0. 


,9987 


0.1943 


0, 


,0030 


-0.1566 


BOUNDED_QUEUE 


5 


4 


1 


3 


2 


9 6 


8 


7 


0. 


,8048 


0.1043 


0. 


,0000 


0.0000 


COMPACT_CURSOR_TREE 


5 


4 


3 


8 


1 


2 9 


6 


7 


0. 


,9830 


0.7254 


0. 


,0000 


0.0000 


DS_ARRAYED_LIST 


5 


4 


1 


8 


2 


9 6 


3 


7 


0. 


,9860 


0.8201 


0. 


,0000 


0.0000 


DS_AVL_TREE 


5 


4 


8 


3 


1 


2 9 


6 


7 


0. 


,9861 


0.2299 


0. 


,0000 


0.0000 


DS_BILINKED_LIST 


5 


4 


3 


1 


2 


9 6 


8 


7 


0. 


,9947 


0.1528 


0, 


,0000 


0.0000 


DS_BINARY_SEARCH_TREE 


5 


3 


4 


8 


1 


2 9 


6 


7 


0. 


,9913 


0.1618 


0, 


,0000 


0.0000 


DS_BINARY_SEARCH_TREE_SET 


8 


4 


5 


3 


2 


1 9 


6 


7 


0. 


,8727 


0.1145 


0, 


,0912 


-0.0355 


DS_HASH_SET 


8 


5 


4 


3 


1 


2 9 


6 


7 


0. 


,9782 


0.2085 


0, 


,0021 


-0.0097 


DS_HASH_TABLE 


8 


5 


4 


3 


1 


9 6 


7 


2 


0. 


,9653 


0.3536 


0, 


,0053 


-0.0258 


DS_LEFT_LEANING_RED_BLACK_TREE 


5 


4 


3 


1 


8 


9 6 


7 


2 


0. 


,9888 


0.3015 


0. 


,0000 


0.0000 


DS_LINKED_LIST 


5 


4 


8 


1 


2 


9 6 


3 


7 


0. 


,9905 


0.7505 


0. 


.0000 


0.0000 


DS_LINKED_QUEUE 


8 


4 


5 


7 


3 


1 9 


6 


2 


0. 


,9940 


0.0986 


0. 


,0094 


-0.0595 


DS_LINKED_STACK 


8 


3 


4 


5 


1 


9 6 


2 


7 


0. 


,9914 


0.1265 


0. 


,0077 


-0.0480 


DS_MULTIARRAYED_HASH_SET 


5 


8 


4 


3 


7 


1 2 


9 


6 


0. 


,9852 


0.4465 


0, 


,0000 


0.0000 


DS_RED_BLACK_TREE 


5 


4 


1 


9 


6 


8 3 


2 


7 


0. 


,9915 


0.2625 


0, 


,0000 


0.0000 


FIXED_TREE 


5 


4 


1 


8 


2 


9 6 


3 


7 


0. 


,9920 


0.4335 


0, 


,0000 


0.0000 


HASH_TABLE 


8 


5 


4 


3 


7 


1 2 


9 


6 


0. 


,9979 


0.1621 


0, 


,0010 


-0.0353 


LINKED_AUTOMATON 


1 


2 


5 


9 


6 


4 8 


3 


7 


0. 


,9693 


0.0669 


0, 


,2517 


-0.1361 


LINKED_CIRCULAR 


5 


3 


4 


1 


8 


2 9 


6 


7 


0. 


,9982 


0.1774 


0. 


,0000 


0.0000 


LINKED_CURSOR_TREE 


5 


8 


4 


3 


1 


9 6 


7 


2 


0. 


,9904 


0.4413 


0. 


,0000 


0.0000 


LINKED_DFA 


5 


4 


8 


3 


1 


2 9 


6 


7 


0. 


,9954 


0.8751 


0. 


,0000 


0.0000 


LINKED_LIST 


3 


5 


4 


8 


1 


2 9 


6 


7 


0. 


,9678 


0.2441 


0. 


,0005 


-0.0017 


LINKED_PRIORITY_QUEUE 


5 


4 


3 


8 


1 


2 9 


6 


7 


0. 


,9866 


0.2391 


0, 


,0000 


0.0000 


LINKED_SET 


5 


4 


1 


8 


2 


9 6 


3 


7 


0. 


,9749 


0.3313 


0, 


,0000 


0.0000 


LINKED_TREE 


5 


4 


8 


3 


7 


1 2 


9 


6 


0. 


,9992 


0.6490 


0, 


,0000 


0.0000 


MULTI_ARRAY_LIST 


5 


1 


4 


8 


9 


6 3 


2 


7 


0. 


,9951 


0.2402 


0, 


,0000 


0.0000 


PART_SORTED_SET 


5 


4 


3 


8 


1 


2 9 


6 


7 


0. 


,9976 


0.1747 


0, 


,0000 


0.0000 


PART_SORTED_TWO_WAY_LIST 


5 


4 


8 


1 


3 


2 9 


6 


7 


0. 


,9851 


0.4510 


0. 


.0000 


0.0000 


SORTED_TWO_WAY_LIST 


5 


4 


3 


8 


1 


2 9 


6 


7 


0. 


,9790 


0.4763 


0. 


.0000 


0.0000 


SUBSET_STRATEGY_TREE 


5 


3 


4 


1 


8 


2 9 


6 


7 


0. 


,9871 


0.2066 


0. 


.0000 


0.0000 


TWO_WAY_CIRCULAR 


5 


3 


4 


8 


1 


2 9 


6 


7 


0. 


,9953 


0.2898 


0. 


.0000 


0.0000 


TWO_WAY_CURSOR_TREE 


5 


4 


1 


2 


9 


8 6 


3 


7 


0. 


,9897 


0.4469 


0, 


,0000 


0.0000 


TWO_WAY_LIST 


5 


4 


1 


8 


2 


9 6 


3 


7 


0. 


,9862 


0.1899 


0, 


,0000 


0.0000 


TWO_WAY_SORTED_SET 


5 


4 


1 


8 


3 


2 9 


6 


7 


0. 


,9933 


0.3427 


0, 


,0000 


0.0000 


TWO_WAY_TREE 


4 


5 


8 


3 


7 


1 2 


9 


6 


0. 


,9993 


0.7493 


0. 


,0000 


-0.0191 




69% (£ 


» is best) 






Mean 


0.0097 


-0.0156 



to conclusively determine what the exact magnitude and the 
ultimate causes of these differences are: possible candidates 
are the fault density of the software tested, the details of the 
algorithms implemented by the testing tools (AutoTest and 
YETI), and the availability of contracts to detect faults. 



3. DISCUSSION AND THREATS 
3.1 Towards a justification for the law 

Arcuri et al. [3] model random testing as a coupon collec- 
tor problem [22]: "Consider a box which contains N types of 
numerous objects. An object in the box is repeatedly sam- 
pled on a random basis. Let pi > denote the probability 
that a type-i object is sampled. The successive samplings 
are statistically independent and the sampling probabilities, 
pi,£>2, • • • ,£>iv, are fixed. When a type-i object is sampled 
for the first time, we say that a type-i object is detected. To 
find the number of samplings required for detecting a set of 
object types (say, object types indexed by i = 1, . . . , n < N) 
is traditionally called coupon collector problem." In random 
testing, the objects are test cases [7, and each type Ui C U is 
a testing target: a unique failure or fault in our experiments. 

Following 22 , let ti be a random variable denoting the 
number of samplings required to draw a test case in target 
Ui, and let r{n) be max{n, . . . , r n }, for n < N, that is the 
number of samplings to cover all the first n targets (in any 
order). It is possible to prove that the expected value of 



r(n) is: 

E[r(n)] = J2 (-!) t+1 E ( 12 ) 

l<i<n J;\J\=i ^3^J Pj 

The inverse <j){k) of r{n) is a function from the number of 
test cases to the expected number of failures, hence it cor- 
responds to an analytic version of </>mean(^)- 

It is possible to approximate r(n) for two special cases 
of probabilities p^s: when they are all equal (pi = p2 = 
■ • • = Pn = 0), and when they are exponentially decreasing 
(pi — #/10* _1 ). We can show that, in the first case, r(n) is 
polynomial in G(log^(n)) for some constant k, hence 4>{k) is 
0(exp(g/c)) for some constant g; in the second case, r(n) is 
6(exp(/m)), hence 4>{k) is S(\og h (k)). 

This might provide a partial explanation for why the faults 
in AutoTest follow a poly- logarithmic curve: the distribution 
of faults has exponentially decreasing probability. It may 
also justify some of the moderate differences in the Java 
experiments, if the fault distribution is different in the Java 
code with respect to the Eiffel code analyzed. 

A more general problem related with this explanation has 
to do with how the coupon collector model applies to ran- 
dom testing of object-oriented programs the way it is avail- 
able in AutoTest and YETI. Such tools generate test cases 
dynamically by incrementally populating a pool of objects 
with random calls. This behavior entails that the probabil- 
ity of sampling an object with certain characteristics (and 
hence of constructing a test case that belongs to a certain 
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target) is not fixed but dynamically varies, and successive 
draws are not statistically independent. For example, the 
first round can only successfully call creation procedures, 
and the probability of constructing a test case involving any 
other routine is zero. Hence, test cases are not really sam- 
pled uniformly at random, and the problem is to determine 
the connection between the real process and its represen- 
tation as a coupon collection process. All we can say with 
the currently available data is that object-oriented random 
testing with AutoTest seems to be describable as a coupon 
collection process with a probability distribution over faults 
that is always exponentially decreasing; this is not quite the 
same as saying that the distribution of bugs is exponentially 
decreasing. 

3.2 Threats to validity 

Threats to construct validity are present in the Java ex- 
periments where we have no reliable way of reconstructing 
faults from failures; a similar threat occurs with the Eiffel ex- 
periments if the contracts incorrectly capture the designer's 
intended behavior. However, even if we may be measuring 
different things, we are arguably still measuring things that 
significantly correlate; the results partially confirm so. 

The very large number of repeated experiments should 
have reduced the potential threats to internal validity - re- 
ferring to the possibility that the study does not support the 
claimed findings - to the minimum [2] . As discussed in Sec- 
tion |2.3| however, the Java experiments are somehow less 
clear-cut than the Eiffel experiments; further experiments 
will hopefully provide conclusive evidence. 

External validity - which refers to the findings' generaliz- 
ability - is limited by the focus on random testing and by 
the availability of software that can be tested with this tech- 
nique. This limitation is largely a consequence of the need 
of designing experiments approachable with the currently 
available technology. In future work, we will target more 
software with contracts (e.g., JML and .NET code equipped 
with CodeContracts) and more testing frameworks (see Sec- 
tion |4| to improve the generalizability of our findings. 

4. RELATED WORK 

Automated random testing is a technique that is inexpen- 
sive to run and proved to find bugs in different contexts, 
including Java libraries and applications [18[ [9j [5l]; Eiffel 
libraries pi; and Haskell programs m. For brevity, we only 
succinctly mention the most closely related work. 

Yeti and AutoTest are two representatives in a series of 
random testing tools developed during the last decade, in- 
cluding Randoop 19 , JCrasher [9], Eclat [l8], Jtest [20], 
Jartege [IS], and RUTE-J [lj. 

Arcuri et al.'s theoretical analysis of random testing [3] is 
an important basis to understand also the practical and em- 
pirical side. Section |3J"1 outlined the connection between the 
work in [3] and the present paper's, as well as the points that 
still remain open. Our preliminary results, hint at a plausi- 
ble consistency between the two results (the theoretical and 
the empirical). To our knowledge, [l6] is the only other em- 
pirical result about random testing; |16] considers only one 
model, which we have included as $1 in our analysis. 
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6. APPENDIX 



6.1 Experiments with Eiffel code 

Table [H] displays the statistics of the comparison between $5 and the other models according to their R 2 scores in the fittings 
across all Eiffel classes used in the experiments. The p- values characterize whether differences are statistically significant. The 
effect size (computed with the Z statistics as Z/y/2N, where N is the number of classes, and hence 2N is the total sample 
size) characterizes the magnitude of the observed differences: an effect size of 0.1 is small; 0.3 is medium; 0.5 is large. 



Table 5: Eiffel classes: differences between $5 and the other <3Vs (Wilcoxon signed-rank test). 

Statistics $1 £2 <&3 $4 ^6 £7 £s £9 

p- value 2.95E-03 28.46E-03 19.53E-06 1.11E-06 52.55E-09 71.79E-09 4.12E-06 61.45E-09 

effect size 324.32E-03 239.05E-03 465.92E-03 531.39E-03 593.82E-03 587.73E-03 502.46E-03 590.77E-03 



6.2 Experiments with Java code 

Tables [6] and [7] show the complete statistics about the testing sessions with YETI on the Java classes described in Section [23] 
They are the extended versions of Tables [3] and [4] 



Table 6: Java classes tested for failures: statistics per class. 



Class 




S 


T 


F 


EM 


E [7] 


E[A] 


a[A] 


java. lang 


Character 


30 


1.29E+05 


32 


9.91E-01 


7.17E-01 


1.24E-02 


6.07E-02 


java. lang. String 


30 


1.31E+05 


84 


7.58E-01 


-1.34E+00 


2.15E-02 


1.04E-01 


java. util 


Array List 


30 


1.08E+05 


10 


4.84E-01 


-1.01E+00 


1.57E-03 


1.85E-02 


java. util 


Calendar 


30 


1.35E+05 


36 


1.31E+00 


4.21E-01 


1.28E-02 


3.89E-02 


java. util 


Date 


30 


1.24E+05 


8 


6.69E-02 


-9.06E-01 


3.31E-03 


1.95E-02 


java. util 


HashMap 


31 


1.13E+05 


7 


2.84E-01 


-3.36E+00 


9.01E-04 


8.89E-03 


java. util 


Hashtable 


30 


1.16E+05 


14 


4.42E-01 


-1.45E+00 


2.27E-03 


2.48E-02 


java. util 


LinkedList 


30 


1.08E+05 


12 


6.70E-02 


-1.31E+00 


5.46E-03 


4.55E-02 


java. util 


Properties 


30 


1.18E+05 


15 


1.25E+00 


5.34E-01 


7.16E-03 


2.84E-02 


java. util 


SimpleTimeZone 30 


1.63E+05 


80 


2.05E+00 


-6.33E-02 


1.93E-02 


7.63E-02 


java. util 


TreeMap 


30 


1.07E+05 


58 


8.56E-01 


-2.03E+00 


2.63E-02 


8.18E-02 


Mean 




30 


1.23E+05 


32 


7.78E-01 


-8.91E-01 


1.03E-02 


4.61E-02 


Median 




30 


1.18E+05 


15 


7.58E-01 


-1.01E+00 


7.16E-03 


3.89E-02 


Stdev 






1.65E+04 


29 


6.02E-01 


1.23E+00 


8.86E-03 


3.07E-02 



Table 7: Java classes tested for faults: statistics per class. 



Class 




S 


T 


F 


EH 




E[ 7 ] 




E[A] 


a[A] 


java. 


lang 


Boolean 


4 


2.49E+05 




0.00E+00 




NaN 





00E+00 


0.00E+00 


java. 


lang 


Byte 


4 


2.79E+05 


2 


4.10E-03 





00E+00 


3 


00E-03 


2.72E-02 


java. 


lang 


Character 


4 


3.67E+05 




0.00E+00 




NaN 





00E+00 


0.00E+00 


java. 


lang 


Class 


4 


2.39E+05 


8 


4.39E-01 


1 


07E-01 


8 


79E-03 


5.34E-02 


java. 


lang 


ClassLoader 


4 


2.66E+05 


8 


2.82E-02 


-3 


85E-01 


8 


50E-03 


6.36E-02 


java. 


lang 


Compiler 


4 


1.15E+05 




0.00E+00 




NaN 





00E+00 


0.00E+00 


java. 


lang 


Double 


4 


2.70E+05 


4 


2.23E-02 


-1 


04E-01 


7 


13E-03 


5.93E-02 


java. 


lang 


Enum 


4 


4.29E+06 




0.00E+00 




NaN 





00E+00 


0.00E+00 


java. 


lang 


Float 


4 


2.71E+05 


3 


9.52E-03 


-4 


40E-01 


5 


26E-03 


5.10E-02 


java. 


lang 


Inheritable ThreadLocal 


4 


2.23E+05 




0.00E+00 




NaN 





00E+00 


0.00E+00 


java. 


lang 


Integer 


4 


3.06E+05 


2 


5.95E-03 


1 


10E-01 


3 


16E-03 


3.43E-02 


java. 


lang 


Long 


4 


3.08E+05 


2 


8.62E-03 


-2 


66E-01 


2 


86E-03 


2.66E-02 


java. 


lang 


Math 


4 


3.28E+05 




0.00E+00 




NaN 





00E+00 


0.00E+00 


java. 


lang 


Number 


4 


0.00E+00 




0.00E+00 




NaN 





00E+00 


0.00E+00 


java. 


lang 


Object 


4 


2.23E+05 




0.00E+00 




NaN 





00E+00 


0.00E+00 


java. 


lang 


Package 


4 


3.36E+06 


1 


4.40E-04 





00E+00 


3 


22E-04 


8.96E-03 


java. 


lang 


Process 


4 


0.00E+00 




0.00E+00 




NaN 





00E+00 


0.00E+00 


java. 


lang 


Process Builder 


4 


2.61E+05 


3 


1.07E-02 


-4 


48E-01 


4 


39E-03 


3.80E-02 


java. 


lang 


Runtime 


4 


3.35E+05 


10 


7.92E-01 


-3 


78E-01 


4 


80E-04 


1.74E-02 


java. 


lang 


RuntimeP ermission 


4 


2.52E+05 




0.00E+00 




NaN 





00E+00 


0.00E+00 


java. 


lang 


Short 


4 


2.84E+05 


2 


7.31E-03 


-3 


11E-01 


2 


89E-03 


2.99E-02 


java. 


lang 


StackTraceElement 


1 


2.12E+05 




0.00E+00 




NaN 





00E+00 


0.00E+00 


java. 


lang 


StrictMath 


4 


3.28E+05 




0.00E+00 




NaN 





00E+00 


0.00E+00 


java. 


lang 


String 


4 


3.45E+05 


4 


2.91E-02 


-3 


32E-01 


3 


31E-03 


3.03E-02 


java. 


lang 


StringBuffer 


1 


7.30E+03 


4 


0.00E+00 




NaN 





00E+00 


2.12E-01 


java. 


lang 


String Builder 


1 


1.69E+03 


6 


0.00E+00 




NaN 





00E+00 


2.97E-01 


java. 


lang 


ThreadLocal 


3 


2.21E+05 




0.00E+00 




NaN 





00E+00 


0.00E+00 


java. 


lang 


Throwable 


3 


2.38E+05 




0.00E+00 




NaN 





00E+00 


0.00E+00 


java. 


lang 


Void 


3 


0.00E+00 




0.00E+00 




NaN 





00E+00 


0.00E+00 


Mean 




4 


4.68E+05 


2 


4.68E-02 




NaN 


1.73E-03 


3.27E-02 


Median 




4 


2.61E+05 




0.00E+00 




NaN 


0.00E+00 


0.00E+00 


Stdev 




1 


9.45E+05 


3 


1.65E-01 




NaN 


2.72E-03 


6.58E-02 



Tables |8] and [9] show the result of fitting the models $1, $2, $4, $5 on the mean fault count 0mean(^) curves, reporting the 
same data as for the AutoTest experiments in Section [2. 2 [ 

Tables [To| and [TT| report the same statistics as Table |5j for the Java classes (failure and fault data). 
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Table 8: Testing of Java classes for failures: best fits with mean. 



Class 


Best Fit Ranking 


7->2 


RMSEbest 


A{R 2 ) 


A(RMSE) 


java. lang. Character 


15 4 2 


9.87E-01 


3.49E-01 


4.89E-02 


4.24E-01 


java. lang . String 


15 4 2 


9.94E-01 


7.37E-01 


3.23E-02 


1.14E+00 


java. util . Array List 


2 15 4 


9.75E-01 


9.76E-02 


2.01E-01 


1.96E-01 


java. util . Calendar 


15 4 2 


9.91E-01 


5.29E-01 


5.85E-03 


1.56E-01 


java. util . Date 


2 15 4 


9.78E-01 


1.08E-01 


1.14E-01 


1.61E-01 


java. util . HashMap 


15 4 2 


9.09E-01 


1.73E-01 


5.71E-02 


4.78E-02 


java. util . Hashtable 


2 15 4 


9.62E-01 


1.44E-01 


2.54E-01 


2.54E-01 


java. util . LinkedList 


2 15 4 


7.41E-01 


4.26E-01 


1.59E-01 


1.15E-01 


java. util . Properties 


5 14 2 


9.80E-01 


2.34E-01 


0.00E+00 


0.00E+00 


java. util . SimpleTimeZone 


2 5 14 


9.98E-01 


3.81E-01 


8.87E-03 


5.93E-01 


java. util . TreeMap 


15 2 4 


9.97E-01 


5.04E-01 


3.14E-03 


2.05E-01 


Mean 


9% (05 is best) 






8.04E-02 


3.00E-01 



64% (05 is 2 nd best) 



Table 9: Testing of Java classes for faults: best fits with mean. 



Class 




Best Fit Ranking 


%f— 


RMSE best 




A{R 2 ) 


A(RMSE) 


java. 


lang 


Boolean 


1 


2 


4 


5 




5.23E-10 




NaN 


4.01E-05 


java. 


lang 


Byte 


1 


5 


4 


2 


9.01E-01 


4.33E-02 


1 


82E-01 


2.99E-02 


java. 


lang 


Character 


1 


2 


4 


5 


-Inf 


4.91E-12 




NaN 


2.41E-05 


java. 


lang 


Class 


2 


1 


5 


4 


9.58E-01 


1.98E-01 


1 


37E-02 


2.91E-02 


java. 


lang 


ClassLoader 


1 


5 


2 


4 


9.42E-01 


1.98E-01 


1 


46E-01 


1.73E-01 


java. 


lang 


Compiler 


1 


2 


4 


5 


-Inf 


2.45E-11 




NaN 


4.00E-05 


java. 


lang 


Double 


1 


5 


4 


2 


9.28E-01 


1.01E-01 


1 


61E-01 


8.12E-02 


java. 


lang 


Enura 


1 


2 


4 


5 


-Inf 


2.52E-11 




NaN 


8.31E-06 


java. 


lang 


Float 


2 


1 


5 


4 


9.80E-01 


3.01E-02 


1 


97E-01 


6.91E-02 


java. 


lang 


Inheritable ThreadLocal 


1 


2 


4 


5 


-Inf 


3.60E-12 




NaN 


4.27E-05 


java. 


lang 


Integer 


2 


1 


5 


4 


9.85E-01 


2.31E-02 


3 


02E-01 


8.20E-02 


java. 


lang 


Long 


2 


1 


5 


4 


9.85E-01 


2.10E-02 


2 


10E-01 


6.05E-02 


java. 


lang 


Math 


1 


2 


4 


5 


-Inf 


6.39E-11 




NaN 


4.07E-05 


java. 


lang 


Number 


1 


2 


4 


5 


NaN 


0.00E+00 




NaN 


1.18E-04 


java. 


lang 


Object 


1 


2 


4 


5 


-Inf 


3.41E-12 




NaN 


3.60E-05 


java. 


lang 


Package 


2 


1 


5 


4 


9.94E-01 


1.84E-03 


2 


16E-01 


9.73E-03 


java. 


lang 


Process 


1 


2 


4 


5 


NaN 


0.00E+00 




NaN 


1.30E-04 


java. 


lang 


Process Builder 


1 


5 


4 


2 


9.72E-01 


4.48E-02 


1 


22E-01 


5.91E-02 


java. 


lang 


Runtime 


5 


1 


4 


2 


8.34E-01 


1.35E-01 





00E+00 


0.00E+00 


java. 


lang 


RuntimeP ermission 


1 


2 


4 


5 


-Inf 


1.63E-11 




NaN 


3.18E-05 


java. 


lang 


Short 


2 


1 


5 


4 


9.94E-01 


1.25E-02 


2 


60E-01 


7.12E-02 


java. 


lang 


StackTraceElement 


1 


2 


4 


5 


-Inf 


4.44E-10 




NaN 


2.05E-05 


java. 


lang 


StrictMath 


1 


2 


4 


5 


-Inf 


2.20E-10 




NaN 


2.82E-05 


java. 


lang 


String 


1 


5 


4 


2 


9.72E-01 


6.20E-02 


9 


72E-02 


7.00E-02 


java. 


lang 


StringBuffer 


2 


1 


5 


4 


1.00E+00 


3.58E-02 


1 


08E-03 


3.34E-02 


java. 


lang 


StringBuilder 


2 


5 


1 


4 


9.91E-01 


7.40E-02 


1 


34E-03 


4.87E-03 


java. 


lang 


ThreadLocal 


1 


2 


4 


5 


-Inf 


5.17E-12 




NaN 


4.28E-05 


java. 


lang 


Throwable 


1 


2 


4 


5 


-Inf 


2.88E-12 




NaN 


1.71E-05 


java. 


lang 


Void 


1 


2 


4 


5 


NaN 


0.00E+00 




NaN 


1.59E-04 


Mean 




3% (05 is best) 








NaN 


2.67E-02 



24% (05 is 2 nd best) 



Table 10: Java failures: differences between $5 and some other <3>;'s (Wilcoxon signed-rank tests). 

Statistics $1 $2 $4 

p- value 18.55E-03 123.05E-03 976.56E-06 

effect size 492.85E-03 341.21E-03 625.54E-03 



Table 11: Java faults: differences between $5 and some other <3Vs (Wilcoxon signed-rank tests). 

Statistics $1 $2 $4 

p- value 20.26E-03 855.22E-03 122.07E-06 

effect size 300.87E-03 28.85E-03 432.75E-03 
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6.3 Eiffel experiments: all graphs 

Figures [T}j7| display, for each of the 42 Eiffel classes tested, the mean curve (in black) and the three models that fit best. 
Horizontal axes are scaled by millions of test cases drawn; vertical axes by total number of faults found. 
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Figure 5: Top 3 fits with mean for six Eiffel classes. 
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6.4 Java failure experiments: all graphs 

Figures [8]{9] display, for each of the 11 Java classes tested for failures, the mean curve (in black) and the three models that 
fit best. Horizontal axes are scaled by tens of thousands of test cases drawn; vertical axes by total number of failures found. 
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6.5 Java fault experiments: all graphs 

Figures [T()fjl2| display, for each of the 29 Java classes tested for faults where at least one fault was found, the mean curve 
(in black) and the three models that fit best. Horizontal axes are scaled by hundreds of thousands of test cases drawn (except 
for class j av a. lang . String Builder which is not scaled); vertical axes by total number of faults found. The last values of the 
mean curve for StringBuffer and StringBuilder are measurement errors that should be ignored. 
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Figure 10: Top 3 fits with mean for six Java classes (faults). 
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